On the 12th Day of Christmas, a Statistician Sent to Me…

Riley, R. D., Cole, T. J., Deeks, J., Kirkham, J. J., Morris, J., Perera, R. et al. (2022). On the 12th Day of Christmas, a Statistician Sent to Me. BMJ, 379, e072883.






The BMJ statistical editors 2022 list of the 12 most common statistical faux pas that they encounter in reviewing articles for submission to the journal (I am guilty of most of these myself).




Being proactive about these issues will not only make you better clinical researchers, it will also increase your chances of a successful submission (to any journal).

On the 1st day of Christmas, a statistician sent to me:

Clarify the research question

  • What is your research question (aims/objectives)?
  • Is your research meant to :
    • Describe
      • Summarise the data.
      • Focus on descriptive statistics.
        • Means (± SDs), Medians (IQR), Counts (%s).
    • Explain (Why?)
      • Association vs Causation.
      • Focus on inferential statistics.
        • t-tests, \(\chi^{2}\)tests, statistical models.
    • Predict (You don’t care about the ‘Why?’)
      • Predict outcome from set of covariates.
      • Focus on maximising predictive power at expense of explanation.
        • Statistical models ± inference.
  • What is the outcome (described statistically as the ‘estimand’)?
  • Does knowing the outcome address your research question.
    • i.e. is the answer aligned with the question?

On the 2nd day of Christmas, a statistician sent to me:

Focus on estimates, confidence intervals, and clinical relevance

  • Statistical significance \(\neq\) clinical significance.

    • If your sample size is large enough, even small effects are statistically significant!
      • e.g. HR = 0.97 (95% C.I. 0.95, 0.99; p < 0.05).
        • Is this still important clinically? (maybe it is).

Conversely:

  • Absence of Evidence \(\neq\) Evidence of Absence.

    • If your sample size is small, clinically relevant effects may be ignored when they shouldn’t be!
      • e.g. HR = 0.70 (95% C.I. 0.40, 1.10; p > 0.05).
        • Is this still important clinically? (maybe it is).

Remember:

  • The point estimate is the ‘best guess’ of the population parameter of interest.
  • The 95% CI provides a range of values for the true estimate of the population parameter that are consistent with our data.

On the 3rd day of Christmas, a statistician sent to me:

Carefully account for missing data

  • Missing values are the norm, not the exception.
  • Quantify and report the amount of missing data.
  • Explain how the missingness was handled in the analyses.
  • Missing data mechanisms/assumptions:
    • e.g. In your study you wish to model the association between self-reported body weight (Y) and sex (X). Some responses aren’t complete, because:
      • some individuals simply didn’t see the question (Missing Completely at Random [MCAR] - “What you hope for”).
      • females in general were less willing to reveal their weight than males (Missing at Random [MAR] - “Second best”).
      • obese individuals in general were less willing to reveal their weight (Missing Not at Random [MNAR] - “The worst”).
    • Very hard to identify the mechanism in any given dataset.
  • Missing data approaches:
    • Ignore.
      • Complete case analysis - unbiased only under MCAR.
    • Impute.
      • Single:
        • Mean/Median substitution - biased.
        • Last observation carried forward (LOCF) - biased.
      • Multiple - unbiased under MCAR/MAR.
  • How little and how much?
    • MI with as little as 5% missingness.
    • Upper limit depends on data - if you have many variables with minimal missingness, one with 80% missing should theoretically be ok with MI.

On the 4th day of Christmas, a statistician sent to me:

Do not dichotomise continuous variables

  • There is no good reason to do this, so don’t do it.
  • e.g. Dichotomise systolic blood pressure at 130 mmHg.
  • Wastes information on between-subject variability.
    • 129 mmHg and 131 mmHg are considered different when really the same.
    • 131 mmHg and 220 mmHg are considered the same when really different.
  • Loss of statistical power.
    • Need a larger sample size to maintain statistical power equivalent to that for continuous data.
  • Where to dichotomise?
    • Can lead to data dredging and the ‘selection’ of optimal cut-points to maximise statistical significance.
  • Non-linear relationships may be masked by dichotomisation.
  • Dichotomisation of confounders does not fully correct for confounding of an exposure-outcome association -> residual confounding.

On the 5th day of Christmas, a statistician sent to me:

Consider non-linear relationships

  • Continuous covariates may have linear OR non-linear association with the outcome.
    • Assuming linearity is a stronger assumption and is often wrong.
    • More realistic to assume non-linearity as a starting point and test if this is necessary.
  • A linear association assumes that a 1 unit increase in the covariate has the same ‘effect’ on the outcome across the range of covariate values.
    • e.g. HR for risk of relapse for change in age from 30 -> 31, same as for 90 -> 91.
    • Provides a simple interpretation.
  • A non-linear association allows the ‘effect’ of the covariate to vary across the range of covariate values.
    • HR for risk of relapse for change in age from 30 -> 31, differs compared to change in age for 90 -> 91.
  • How to model non-linearity:
    • Linear splines - different linear functions across range of covariate values.

      • Better than assuming complete linearity but still restrictive.
    • Regular polynomials - quadratics (\(x^2\)), cubics (\(x^3\)), etc (integer powers).

      • Disadvantage - ‘global’ effects. One point can influence the entire curve.
    • Fractional polynomials - fractional powers (e.g. \(x^{2/3}\))

    • Restricted cubic splines (RCS) - multiple regular cubic polynomials joined smoothly at ‘knots’.

      • Models non-linearity nicely. Can capture local effects without one point influencing the entire curve.
      • Disadvantage - interpretation.
  • RCS interpretation:
    • Visualise - plot model predictions.
    • Consider specifying effect at salient values of covariate.
      • e.g. HR for one-year change in age at 30 yrs, vs HR for one-year change in age at 90 yrs.

On the 6th day of Christmas, a statistician sent to me:

Quantify differences in subgroup results

  • Sometimes you may be interested in estimating a treatment (or some other) ‘effect’ in subgroups (sex, disease status, etc).
  • There are two ways to do this:
    • One model with an interaction term between your treatment and subgroup variables (e.g. treatment * sex).
      • -> “The effect of treatment depends on sex”.
    • Two models (one for each sex).
      • -> Separate treatment effects.
  • The first is the better approach.
  • Consider example - Association between hormone replacement therapy (HRT) and non-vertebral fractures in women.
    • In subgroup women < 60 yrs, RR of fracture (HRT vs none) = 0.67 (95% C.I. 0.46, 0.98; p = 0.03).
    • In subgroup women ≥ 60 yrs, RR of fracture (HRT vs none) = 0.88 (95% C.I. 0.71, 1.08: p = 0.22).
  • Naive interpretation is to suggest HRT is beneficial for younger women but not for older women.
  • BUT, knowing the treatment effect in each group, tells you NOTHING about whether the treatment effects themselves, differ!
  • The interaction term quantifies this difference.
    • Ratio of risk ratio’s = 0.76 (95% C.I. 0.49, 1.17; p = 0.2)

    • Another Absence of Evidence \(\neq\) Evidence of Absence problem.

    • Older women may still benefit from HRT but we can’t say with certainty - possible loss of power from splitting the data and reducing the sample size.

    • The current data don’t support that conclusion.

  • It is not sufficient for the treatment effect to be significant in one subgroup and not the other - the interaction term must also be significant.

On the 7th day of Christmas, a statistician sent to me:

Consider accounting for clustering

  • Don’t ignore clusters in your data - within- or between-individual.
  • Two main cases:
    • Repeated measurements within patients.
    • Patients within larger groupings (e.g.hospitals/GP clinics).
  • These give rise to correlated observations which can bias inference if not accounted for correctly.
  • Statistical tests/models are based on an assumption of independent observations - each observation provides ‘new’ information.
  • This is the assumption of independent observations.
  • If you had to choose between the following 10 observations, which would you choose to maximise the statistical power in your study:
    • EDSS measured 10 times on one patient.
    • EDSS measured once on each of 10 patients.
  • If observations are correlated in some way, they no longer provide completely new information but are still treated as such.
  • This ‘tricks’ our statistical tests/models into thinking we have more information than we really do and may bias inference - usually in a false-positive kind of way (i.e. standard errors and p values erroneously low)
  • Solution - use an analytical approach that accounts for the non-independence:
    • Mixed-models.
    • Generalised estimating equations.

On the 8th day of Christmas, a statistician sent to me:

Interpret \(I^2\) and meta-regression appropriately

  • In meta-analyses, multiple smaller studies are combined to increase statistical power in order to estimate a treatment effect.

  • Ideally, all studies to be combined would be undertaken in the same way and to the same experiment protocols (e.g. study design, treatment regime, inclusion criteria, etc).

  • Differences between outcomes would then only be due to chance, and the studies would be considered homogenous.

  • But, this isn’t the real world. Studies are never done exactly the same.

  • Therefore, we expect some study heterogeneity arising from the variability in outcomes beyond that due to chance.

  • If heterogeneity is high, should the studies even be combined?

  • One (the most common?) measure of heterogeneity is \(I^2\).

  • \(I^2\) describes the percentage of variability in (treatment) effect estimates that is due to between study heterogeneity rather than chance.

    • i.e. It tells us what proportion of the variance in observed effects reflects variance in true effects rather than sampling error.
  • It is a relative measure of heterogeneity and should NOT be interpreted as absolute.

    • -> 0% - studies more homogenous (but the actual between-study variance may still be high - i.e. a small percentage of a large number may still be large).
    • -> 100% - studies more heterogenous (but the actual between-study variance may still be low - i.e. a large percentage of a small number may still be small).
  • Meta-regression can be considered an extension of meta-analysis that combines, compares, and synthesizes research findings from multiple studies using regression analysis to adjust for the effects of available covariates on a response variable.

  • Covariates usually at the study level, rather than individual level (e.g. mean age, proportion men).

  • Usually underpowered.

  • Beware of aggregation bias -> ecological fallacy:

    • What applies at the group level may not apply at the individual level.

On the 9th day of Christmas, a statistician sent to me:

Assess calibration of model predictions

  • Clinical prediction models estimate outcome values (for continuous outcomes) or outcome risks (for binary or time-to-event outcomes) to inform diagnosis and prognosis in individuals.
  • In development and validation of such models, a full evaluation of model performance is critical to ensure predictions are correct and patients are not inadvertently harmed (due to false positives or false negatives).
  • For models that estimate outcome risk, performance can be evaluated in terms of :
    • Discrimination (is model accuracy high? - ROC curves/AUC/etc)
    • Calibration (do predicted probabilities align with observed proportions?)
    • Clinical utility (how will this benefit patients?)
  • Discrimination is a relative measure of risk and is commonly assessed.
  • Calibration is an absolute measure of risk and is often ignored.
  • A well-discriminating model can be miscalibrated and this can affect its clinical utility.
  • e.g. Consider a model with a high AUC (area under the curve).
    • Risk ratio may be correctly calculated as 5 - i.e a 5-fold increased risk of THE outcome in exposed vs unexposed groups.
    • BUT, calibration is poor because absolute risks of the outcome are predicted as 5% in exposed and 1% in unexposed, when in reality the risks are 50% and 10%.
  • May need to reassess model using penalisation/shrinkage methods which reduce chance of overfitting.

On the 10th day of Christmas, a statistician sent to me:

Carefully consider the variable selection approach

  • There is increasing awareness of the disadvantages in using data-driven variable selection methods - i.e. selecting variables based on their statistical significance (p value). This applies in both:
    • Univariable screening of ‘candidate’ variables.
    • Multivariable model building - backwards elimination, forward selection, etc.
  • There is no issue if you prespecify a single model and then estimate that model from your sample of data.
  • The issue is when you test many different models and present a ‘final’ model as if it were prespecified.
  • This tends to result in standard errors and p values that are underestimated (too small) and regression coefficients that overstate their true relationship with the outcome (too large).
  • So what should we do? If we want best practice:
  • For explanatory modelling (associations/causation):
    • Use theory, expert knowledge and the available literature to prespecify a model.
    • In addition, analyses intended to claim causal effects should be supported by DAG’s (Directed Acyclic Graphs) where ALL potential confounders - and their relationships with the exposure, outcome and each other - are thoroughly documented.
  • For predictive modelling:
    • Variable selection (through shrinkage) may be incorporated using methods such as lasso or elastic net, which start with a full model including all candidate predictors for potential inclusion.

On the 11th day of Christmas, a statistician sent to me:

Assess the impact of any assumptions

  • Have you considered sensitivity analyses?
  • What is a sensitivity analysis?
    • A method to determine the robustness of an assessment by examining the extent to which results are affected by changes in methods, models, values of unmeasured variables, or assumptions.
  • A sensitivity analysis can help to answer the following questions:
    • How confident can I be about the results?
    • Will the results change if I change the definition of the outcome (e.g., using different cut-off points)?
    • Will the results change if I change the method of analysis?
    • Will the results change if we take missing data into account? Will the method of handling missing data lead to different conclusions?
    • How much influence will minor protocol deviations have on the conclusions?
    • What if the data were assumed to have a nonNormal distribution or there were outliers?
    • What if the hazard ratio isn’t constant over time?

On the 12th day of Christmas, a statistician sent to me:

Use reporting guidelines to avoid overinterpretation

  • “Readers should not have to infer what was probably done, they should be told explicitly.”
  • Readers should not have to search for, or be unsure about:
    • The rationale and objectives of a reported study.
    • The study design.
    • Details of the methods used.
    • Participant characteristics.
    • Results.
    • Certainty of evidence.
    • Research implications.
  • Make use of the appropriate reporting guidelines:
    • Provide a checklist of items to be reported which represent the minimum detail required to enable readers to understand the research and critically appraise its findings.
  • Randomised trials:
    • CONSORT (consolidated standards of reporting trials).
  • Observational studies:
    • STROBE (strengthening the reporting of observational studies in epidemiology).
  • Systematic reviews:
    • PRISMA (preferred reporting items for systematic reviews and meta-analyses).
  • Diagnostic test accuracy:
    • STARD (standards for reporting diagnostic accuracy studies).
  • Prediction model studies:
    • TRIPOD (transparent reporting of a multivariable prediction model for individual prognosis or diagnosis).
  • All are available at https://www.equator-network.org

The happy statistician’s christmas carol can then become: